<title>L7</title>
<html>
<head>
</head>
<body>

<h1>Locking</h1>

<p>Required reading: spinlock.c

<h2>Why coordinate?</h2>

<p>Mutual-exclusion coordination is an important topic in operating
systems, because many operating systems run on
multiprocessors. Coordination techniques protect variables that are
shared among multiple threads and updated concurrently.  These
techniques allow programmers to implement atomic sections so that one
thread can safely update the shared variables without having to worry
that another thread intervening.  For example, processes in xv6 may
run concurrently on different processors and in kernel-mode share
kernel data structures.  We must ensure that these updates happen
correctly.

<p>List and insert example:
<pre>

struct List {
  int data;
  struct List *next;
};

List *list = 0;

insert(int data) {
  List *l = new List;
  l->data = data;
  l->next = list;  // A
  list = l;        // B
}
</pre>

<p>What needs to be atomic? The two statements labeled A and B should
always be executed together, as an indivisible fragment of code.  If
two processors execute A and B interleaved, then we end up with an
incorrect list.  To see that this is the case, draw out the list after
the sequence A1 (statement executed A by processor 1), A2 (statement A
executed by processor 2), B2, and B1.

<p>How could this erroneous sequence happen? The varilable <i>list</i>
lives in physical memory shared among multiple processors, connected
by a bus.  The accesses to the shared memory will be ordered in some
total order by the bus/memory system.  If the programmer doesn't
coordinate the execution of the statements A and B, any order can
happen, including the erroneous one.

<p>The erroneous case is called a race condition.  The problem with
races is that they are difficult to reproduce.  For example, if you
put print statements in to debug the incorrect behavior, you might
change the time and the race might not happen anymore.

<h2>Atomic instructions</h2>

<p>The programmer must be able express that A and B should be executed
as single atomic instruction.  We generally use a concept like locks
to mark an atomic region, acquiring the lock at the beginning of the
section and releasing it at the end: 

<pre> 
void acquire(int *lock) {
   while (TSL(lock) != 0) ; 
}

void release (int *lock) {
  *lock = 0;
}
</pre>

<p>Acquire and release, of course, need to be atomic too, which can,
for example, be done with a hardware atomic TSL (try-set-lock)
instruction:

<p>The semantics of TSL are:
<pre>
   R <- [mem]   // load content of mem into register R
   [mem] <- 1   // store 1 in mem.
</pre>

<p>In a harware implementation, the bus arbiter guarantees that both
the load and store are executed without any other load/stores coming
in between.

<p>We can use locks to implement an atomic insert, or we can use
TSL directly:
<pre>
int insert_lock = 0;

insert(int data) {

  /* acquire the lock: */
  while(TSL(&insert_lock) != 0)
    ;

  /* critical section: */
  List *l = new List;
  l->data = data;
  l->next = list;
  list = l;

  /* release the lock: */
  insert_lock = 0;
}
</pre>

<p>It is the programmer's job to make sure that locks are respected.  If
a programmer writes another function that manipulates the list, the
programmer must must make sure that the new functions acquires and
releases the appropriate locks.  If the programmer doesn't, race
conditions occur.

<p>This code assumes that stores commit to memory in program order and
that all stores by other processors started before insert got the lock
are observable by this processor.  That is, after the other processor
released a lock, all the previous stores are committed to memory.  If
a processor executes instructions out of order, this assumption won't
hold and we must, for example, a barrier instruction that makes the
assumption true.


<h2>Example: Locking on x86</h2>

<p>Here is one way we can implement acquire and release using the x86
xchgl instruction:

<pre>
struct Lock {
  unsigned int locked;
};

acquire(Lock *lck) {
  while(TSL(&(lck->locked)) != 0)
    ;
}

release(Lock *lck) {
  lck->locked = 0;
}

int
TSL(int *addr)
{
  register int content = 1;
  // xchgl content, *addr
  // xchgl exchanges the values of its two operands, while
  // locking the memory bus to exclude other operations.
  asm volatile ("xchgl %0,%1" :
                "=r" (content),
                "=m" (*addr) :
                "0" (content),
                "m" (*addr));
  return(content);
}
</pre>

<p>the instruction "XCHG %eax, (content)" works as follows:
<ol>
<li> freeze other CPUs' memory activity
<li> temp := content
<li> content := %eax
<li> %eax := temp
<li> un-freeze other CPUs
</ol>

<p>steps 1 and 5 make XCHG special: it is "locked" special signal
  lines on the inter-CPU bus, bus arbitration

<p>This implementation doesn't scale to a large number of processors;
  in a later lecture we will see how we could do better.

<h2>Lock granularity</h2>

<p>Release/acquire is ideal for short atomic sections: increment a
counter, search in i-node cache, allocate a free buffer.

<p>What are spin locks not so great for?  Long atomic sections may
  waste waiters' CPU time and it is to sleep while holding locks. In
  xv6 we try to avoid long atomic sections by carefully coding (can
  you find an example?).  xv6 doesn't release the processor when
  holding a lock, but has an additional set of coordination primitives
  (sleep and wakeup), which we will study later.

<p>My list_lock protects all lists; inserts to different lists are
  blocked. A lock per list would waste less time spinning so you might
  want "fine-grained" locks, one for every object BUT acquire/release
  are expensive (500 cycles on my 3 ghz machine) because they need to
  talk off-chip.

<p>Also, "correctness" is not that simple with fine-grained locks if
  need to maintain global invariants; e.g., "every buffer must be on
  exactly one of free list and device list".  Per-list locks are
  irrelevant for this invariant.  So you might want "large-grained",
  which reduces overhead but reduces concurrency.

<p>This tension is hard to get right. One often starts out with
  "large-grained locks" and measures the performance of the system on
  some workloads. When more concurrency is desired (to get better
  performance), an implementor may switch to a more fine-grained
  scheme. Operating system designers fiddle with this all the time.

<h2>Recursive locks and modularity</h2>

<p>When designing a system we desire clean abstractions and good
  modularity. We like a caller not have to know about how a callee
  implements a particul functions.  Locks make achieving modularity
  more complicated.  For example, what to do when the caller holds a
  lock, then calls a function, which also needs to the lock to perform
  its job.

<p>There are no transparent solutions that allow the caller and callee
  to be unaware of which lokcs they use. One transparent, but
  unsatisfactory option is recursive locks: If a callee asks for a
  lock that its caller has, then we allow the callee to proceed.
  Unfortunately, this solution is not ideal either.

<p>Consider the following. If lock x protects the internals of some
  struct foo, then if the caller acquires lock x, it know that the
  internals of foo are in a sane state and it can fiddle with them.
  And then the caller must restore them to a sane state before release
  lock x, but until then anything goes.

<p>This assumption doesn't hold with recursive locking.  After
  acquiring lock x, the acquirer knows that either it is the first to
  get this lock, in which case the internals are in a sane state, or
  maybe some caller holds the lock and has messed up the internals and
  didn't realize when calling the callee that it was going to try to
  look at them too.  So the fact that a function acquired the lock x
  doesn't guarantee anything at all. In short, locks protect against
  callers and callees just as much as they protect against other
  threads.

<p>Since transparent solutions aren't ideal, it is better to consider
  locks part of the function specification. The programmer must
  arrange that a caller doesn't invoke another function while holding
  a lock that the callee also needs.

<h2>Locking in xv6</h2>

<p>xv6 runs on a multiprocessor and is programmed to allow multiple
threads of computation to run concurrently.  In xv6 an interrupt might
run on one processor and a process in kernel mode may run on another
processor, sharing a kernel data structure with the interrupt routing.
xv6 uses locks, implemented using an atomic instruction, to coordinate
concurrent activities.

<p>Let's check out why xv6 needs locks by following what happens when
we start a second processor:
<ul>
<li>1516: mp_init (called from main0)
<li>1606: mp_startthem (called from main0)
<li>1302: mpmain
<li>2208: scheduler.
  <br>Now we have several processors invoking the scheduler
  function. xv6 better ensure that multiple processors don't run the
  same process!  does it?
  <br>Yes, if multiple schedulers run concurrently, only one will
  acquire proc_table_lock, and proceed looking for a runnable
  process. if it finds a process, it will mark it running, longjmps to
  it, and the process will release proc_table_lock.  the next instance
  of scheduler will skip this entry, because it is marked running, and
  look for another runnable process.
</ul>

<p>Why hold proc_table_lock during a context switch?  It protects
p->state; the process has to hold some lock to avoid a race with
wakeup() and yield(), as we will see in the next lectures.

<p>Why not a lock per proc entry?  It might be expensive in in whole
table scans (in wait, wakeup, scheduler). proc_table_lock also
protects some larger invariants, for example it might be hard to get
proc_wait() right with just per entry locks.  Right now the check to
see if there are any exited children and the sleep are atomic -- but
that would be hard with per entry locks.  One could have both, but
that would probably be neither clean nor fast.

<p>Of course, there is only processor searching the proc table if
acquire is implemented correctly. Let's check out acquire in
spinlock.c:
<ul>
<li>1807: no recursive locks!
<li>1811: why disable interrupts on the current processor? (if
interrupt code itself tries to take a held lock, xv6 will deadlock;
the panic will fire on 1808.)
<ul>
<li>can a process on a processor hold multiple locks?
</ul>
<li>1814: the (hopefully) atomic instruction.
<ul>
<li>see sheet 4, line 0468.
</ul>
<li>1819: make sure that stores issued on other processors before we
got the lock are observed by this processor.  these may be stores to
the shared data structure that is protected by the lock.
</ul>

<p>

<h2>Locking in JOS</h2>

<p>JOS is meant to run on single-CPU machines, and the plan can be
simple.  The simple plan is disabling/enabling interrupts in the
kernel (IF flags in the EFLAGS register).  Thus, in the kernel,
threads release the processors only when they want to and can ensure
that they don't release the processor during a critical section.

<p>In user mode, JOS runs with interrupts enabled, but Unix user
applications don't share data structures.  The data structures that
must be protected, however, are the ones shared in the library
operating system (e.g., pipes). In JOS we will use special-case
solutions, as you will find out in lab 6.  For example, to implement
pipe we will assume there is one reader and one writer.  The reader
and writer never update each other's variables; they only read each
other's variables.  Carefully programming using this rule we can avoid
races.
